Deriving and analyzing Quantitative Disease Risk Scores from Electronic Health Records in eMERGE Network.
library(ggplot2)
library(knitr)
library(kableExtra)
knitr::opts_chunk$set(echo = TRUE)
This document describes quantitative disease risk scores (QDRSs) based on almost unsupervised methods along with the main features and functionality of the QDRS package, developed for the analysis of Electronic Health Record (EHR) data. The almost unsupervised methods require minimal input from clinicians, can be applied to large datasets, and alleviate some of the main weaknesses of existing phenotyping algorithms. The flow of analysis in an application to Electronic Medical Records and Genomics (eMERGE) network is provided in order to show how the QDRS can be applied to EHRs and genetic data.
The development version of QDRS from Github can be installed and loaded as follows.
# install.packages("devtools")
devtools::install_github("danqingxu/QDRS")
library(QDRS)
# depends on development version of R package gllvm
#devtools::install_github("JenniNiku/gllvm")
Suppose we have \(J\) phenotypic features for a given set of \(n\) subjects; these include phecodes, but can also include laboratory values and other clinical covariates. Each phenotypic feature is centered and scaled by its sample standard deviation in the preprocessing step. The goal is, for a given (complex) disease phenotype, to construct a quantitative disease risk score (QRS) as a weighted linear combination of multiple phenotypic features, i.e., the score for subject \(i\) is defined as \[ \text{QRS}_i = \sum_{j=1}^{J} w_j x_{ij}, \] where \(x_{ij}\) is the standardized value of \(j\)th phenotypic feature for subject \(i\). Here we consider several possible unsupervised methods to derive the set of weights \(w_j\)’s. We denote by \(\mathbf{X}_{n\times J}\) the matrix of \(J\) standardized phenotypic features for the \(n\) subjects. Let \(\mathbf{Q}\) be the \(J\times J\) sample correlation matrix of the \(J\) phenotypic features.
The phenotype risk score (PheRS) has been recently proposed to combine binary phecodes in an individual (Bastarache et al. 2018), with phecodes’ weights based on the inverse prevalence of the phecode in a given population. Here we use the controls and individuals with unknown status for a given disease, but one can also use an external reference population. Given a dataset of \(N\) subjects, the weight for phenotypic feature \(j\) is calculated as: \[\begin{align*} w_j = \log \frac{N}{n_j}, \end{align*}\] where \(n_j\) is the number of individuals with phenotypic feature \(j\). Hence, less prevalent phenotypic features are given higher weight compared to the more common ones. The rationale behind this weighting scheme is that lower frequency phecodes are more likely to be related to risk of disease in general, an assumption that is likely to be reasonable. However, these weights are not related to the phenotype under consideration.
For subject \(i\), the PheRS for a specific disease is calculated as in Bastarache et al. (2018): \[\begin{align} \text{PheRS}_i = \sum_{j=1}^{J} w_j \mathbb{1}_{\{\text{subject } i \text{ has phenotypic feature }j\}}. \end{align}\]
Eigen Score. The Eigen score is a spectral approach based on the correlation matrix \(\mathbf{Q}\) of \(J\) phenotypic features. We have previously proposed this approach in a different context, namely to derive functional scores for genetic variants (Ionita-Laza et al. 2016). Assuming conditional independence of phenotypic features given the true disease status of an individual (either affected or unaffected), the off-diagonal entries of \(\mathbf{Q}\) are identical to those of a rank-one matrix \(\mathbf{R} = \lambda \mathbf{v} \mathbf{v}^\top\) with unit-norm eigenvector \(\mathbf{v}\) and eigenvalue \(\lambda\). Up to a sign ambiguity, the entries of \(\mathbf{v}\) are proportional to the balanced accuracies (the average between the sensitivity and the specificity) of the \(J\) phenotypic features. We construct an estimate \(\hat{\mathbf{R}}\) of the rank-one matrix \(\mathbf{R}\) by solving a system of equations as in (Ionita-Laza et al. 2016), and then we perform eigendecomposition of \(\hat{\mathbf{R}}\) to obtain the leading eigenvector \(\hat{\mathbf{v}}\). The weight of the \(j\)th phenotypic feature is defined as \(\hat{v}_j\), the \(j\)th entry of \(\hat{\mathbf{v}}\). The Eigen score of subject \(i\) is the weighted sum of \(J\) phenotypic features: \[\begin{align*} \text{Eigen}_i = \sum_{j=1}^{J}x_{ij}\hat{v}_j, \end{align*}\] where \(x_{ij}\) is the standardized value of the \(j\)th phenotype for subject \(i\).
Principal Component Analysis. Principal component analysis (PCA) is an alternative approach to reduce the dimension of the phenotypic feature space, and identify a small number of principal components (PCs). In particular, we have the spectral decomposition of \(\mathbf{Q}\), \(\mathbf{Q}=\sum_{j=1}^{J}\lambda_j\mathbf{u}_j\mathbf{u}_j^\top\), where \(\lambda_1\geq \cdots \geq \lambda_J>0\) are eigenvalues of \(\mathbf{Q}\) and \(\mathbf{u}_j\) is the \(j\)th eigenvector associated with the \(j\)th largest eigenvalue.
One possible composite score is based on using the entries of the top eigenvector \(\mathbf{u}_1\) (i.e. the loadings) as the weights for the \(J\) phenotypic features. We call this score PC1. Unlike the weights defined by Eigen which are related to the balanced accuracies of the individual features, the weights defined by \(\mathbf{u}_1\) are not guaranteed to reflect prediction accuracies. Furthermore, it is possible that additional principal components are also informative (Aschard et al. 2014; Liu and Lin 2019), and so we consider combining multiple PCs as a linear combination (LPC), especially when including a large number of phenotypic features. The general form of the LPC score for subject \(i\) is \[\begin{align*} \text{LPC}_i = \sum_{k=1}^{K}\beta_k\text{PC}_{ik}, \end{align*}\] where \(K\leq J\) represents the number of PCs being included in the linear combination, and \(\textbf{PC}= \mathbf{X}\mathbf{U}\) is the PC score matrix (\(\mathbf{U}\) is the matrix of eigenvectors). \(K\) can be determined using the Tracy-Widom test (Johnstone and others 2001), a hypothesis test to identify significant eigenvalues of the covariance matrix.
There are several considerations with the LPC method. The first is the choice of sign of individual PCs in the linear combination. We use the training data (see below) to help us identify the sign of each PC. Namely, the sign of each PC is adjusted so that the mean PC value of “cases” is higher than that for “controls”. Although we refer to the LPC approach as unsupervised, we do use some approximate label information to determine the signs of individual PCs in the linear combination. However, we do not need to use gold standard labels, and weakly defined labels are sufficient (e.g. we could define as “cases” those individuals with some lab result present, e.g. the estimated glomerular filtration rate (eGFR) in the case of Chronic Kidney Disease). The second important consideration is the choice of weights. We choose here to use the corresponding eigenvalues \(\lambda\) as weights, so higher weight is assigned to those components with higher amount of variance explained. However the lower PCs (corresponding to smaller eigenvalues) can be as useful for prediction as the top PCs, and hence the approach here may not be optimal. If some amount of labeled data is available, weights can be learned by regression models, such as principal component regression and partial least squares.
Generalized linear latent variable models (GLLVMs) are popular tools for modeling multivariate, correlated, mixed outcomes (Sammel, Ryan, and Legler 1997). But a main drawback of this model is the computational cost. A recently proposed framework for efficient estimation of GLLVMs, motivated by high-dimensional multivariate abundance data, using either the Laplace approximation method or the variational approximation method, reduces the computational burden for fitting GLLVMs with multivariate phenotypic data (Niku et al. 2019). We consider the \(J\) observed phenotypic features \(\mathbf{x}_i=(x_{i1},\ldots,x_{iJ})^\top\) are the physical manifestation of an unobserved latent variable \(b_i\) that follows a normal distribution \(f(b_i)\), and a generalized latent variable model regresses the mean phenotypic features, denoted as \(\mu_{ij}\), against the latent variable \(b_i\), \(g(\mu_{ij}) = \beta_{j}+\theta_jb_i\), where \(\beta_{j}\) and \(\theta_j\) are feature specific intercept and slope coefficient related to the latent variable, and \(g(\cdot)\) is the corresponding link function, e.g. probit function links the probability of success for a binary outcome and the latent variable. Then the posterior estimate of the latent variable \(b_i\) for an individual \(i\), which we call LVS, is an estimate of severity for a particular phenotype based on the observable phenotypic features. The estimation of GLLVM is based on a variational approximation method, originally developed in the machine learning field (Wainwright, Jordan, and others 2008) to approximate the posterior distribution \(h(b\vert \mathbf{x}_i)\) by a variational density (e.g. a normal density) while minimizing the Kullback-Leibler divergence between the true posterior density and the proposed variational density.
Main functions of the QDRS package can be used to compute the QDRS based on the above mentioned methods with the arguments listed in the following (See examples in Section Analysis of eMERGE Data>Deriving QDRS):
PheRS(X, feature.prevalence = NULL, group = NULL)
eigen.score(X, training, scale = TRUE)
PC(X, group, training, scale = TRUE, pc.num)
LPC(X, group, training, scale = TRUE)
LVS.score(X, Y = NULL, family = "binomial",
starting.choice = "random", p.seed = 3080)
The argument X in functions PheRS(), eigen.score(), PC(), LPC() represents the input feature matrix. The PheRS weights can be specified directly through the argument feature.prevalence, derived from controls provided by the argument group, or simple calculated using all subjects if neither of these two arguments are available. The argument training is an index or logical vector that indicates whether the subject is used for training.
For LVS.score() function, the training and test sets are specified through the arguments X and Y respectively. The default of argument family, distribution function for features, is "binomial". List of available distributions with the mean, \(E(y_{ij})\), and mean-variance, \(V(\mu_{ij})\), functions, and link functions for various response types are in the table below. The mean and mean-variance are estimated by variational approximation methods.
| Response | Distribution | Link | Description |
|---|---|---|---|
| Counts | Poisson | log | \(E(y_{ij}) = \mu_{ij}\), \(V(\mu_{ij})=\mu_{ij}\) |
| NB | log | \(E(y_{ij}) = \mu_{ij}\), \(V(\mu_{ij})=\mu_{ij}+\phi_j\mu_{ij}^2\), | |
| where \(\phi_j>0\) is a dispersion parameter | |||
| Binary | Bernoulli | probit | \(E(y_{ij}) = \mu_{ij}\), \(V(\mu_{ij}) = \mu_{ij}(1-\mu_{ij})\) |
| Ordinal | Multinomial | probit | Cumulative probit model |
| Normal | Gaussian | identity | \(E(y_{ij}) = \mu_{ij}\), \(V(y_{ij}) = \phi_j^2\) |
| Positive continuous | Gamma | log | \(E(y_{ij}) = \mu_{ij}\), \(V(y_{ij}) = \mu_{ij}^2/\phi_j\) |
| where \(\phi_j\) is a shape parameter | |||
| non-negative continuous | Exponential | log | \(E(y_{ij}) = \mu_{ij}\), \(V(y_{ij}) = \mu_{ij}^2\) |
The proposed methods have been applied to six different diseases, which include Chronic Kidney Disease (CKD), Coronary Artery Disease (CAD), Type 2 Diabetes (T2D), Heart Failure (HF), Dementia and Gastroesophageal Reflux Disease (GERD). Here we focus on CKD for illustration purposes. This section will follow the flow chart in Figure 1 to show the analysis of eMERGE data sets as an example. The data set description of eMERGE-seq panel genetic data is arranged to Section Genetic Association Studies.
knitr::include_graphics("fig/Analysis_Flow_Chart.jpg")
Figure 1: Analysis Flow Chart of eMERGE Application.
The Chronic Kidney Disease (CKD) cohort consists of 98,486 (102,597 individuals with phecodes in eMERGE, minus 4,111 End Stage Renal Disease (ESRD) patients) subjects with demographic information such as gender, race, age, and 1,817 phecodes that are divided into 18 categories (1,866 phecodes minus 49 zero-prevalence phecodes, Table 1.
load("data/phecode_category_counts.rda")
library(rmarkdown)
library(kableExtra)
rownames(order.tb) = NULL
kable(order.tb, caption = "**The number of phecodes with non-zero prevalence in the eMERGE dataset for each category.**", booktabs = TRUE) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),fixed_thead = T)
| Category | Counts |
|---|---|
| Genitourinary | 175 |
| Circulatory system | 172 |
| Endocrine/metabolic | 169 |
| Digestive | 166 |
| Neoplasms | 141 |
| Musculoskeletal | 132 |
| Sense organs | 128 |
| Injuries & poisonings | 126 |
| Dermatologic | 96 |
| Respiratory | 85 |
| Neurological | 84 |
| Mental disorders | 76 |
| Hematopoietic | 62 |
| Pregnancy complications | 58 |
| Congenital anomalies | 56 |
| Symptoms | 50 |
| Infectious diseases | 23 |
| NULL | 18 |
Each individual in the eMERGE network has raw longitudinal records of ICD-9 and ICD-10 codes, which can be mapped to phecodes (Denny et al. 2013), and the presence of a particular phecode is defined by at least two occurrences of the corresponding ICD-9 or ICD-10 codes in individual health records. Each phecode is used as a proxy of the corresponding condition. Note that here the absence of a phecode may be due to no assessment of the condition or no record of healthy condition.
The phecodes history of the CKD cohort is reformatted in wide format, i.e., each row represents a subject, each column represents a phecode and entries are indication of the phecodes. The feature matrix of 98,486 individuals are centered and scaled before splitting into training and test sets for the Eigen and PC-based approaches. The phecode features are on the original scale for PheRS and LVS derivation.
load("data/example_feature_matrix.rda")
kable(example_unscaled,caption = "**Example CKD Dataset.**", booktabs = TRUE) %>%
kable_styling(bootstrap_options = c("striped", "hover",
"condensed", "responsive"),fixed_thead = T)
| Race | Group | G_Staging | Min_Age | Max_Age | code_585.3 | code_433.1 | code_172.1 |
|---|---|---|---|---|---|---|---|
| C41261 | Case | G2 | 54 | 71 | 0 | 0 | 0 |
| C16352 | Case | G2 | 34 | 43 | 1 | 0 | 0 |
| C41261 | Control | Control | 44 | 62 | 0 | 0 | 0 |
| C41261 | Case | G1 | 60 | 78 | 0 | 0 | 1 |
| C41261 | Control | Control | 44 | 59 | 0 | 0 | 0 |
| C41261 | Case | G3a | 73 | 89 | 0 | 0 | 1 |
We took advantage of a CKD phenotyping algorithm recently developed within eMERGE to diagnose and place individuals on a CKD staging grid by estimated glomerular filtration rate (eGFR) (al. 2019) (eGFR slope is a commonly used indicator of global kidney function). The CKD stages include Control, G1, G2, G3a, G3b, and G4. We divided the case-control dataset in Table 3 into two parts, and used 50% of the cases and 50% of controls for training purposes. We used 104 feature phecodes to build the quantitative disease risk scores, the choice of which is explained in the next section.
For consistency, we focused on the same set of 98,486 subjects as above in the analyses of additional phenotypes, including Coronary Artery Disease (CAD), Type 2 Diabetes (T2D), Heart Failure (HF), Dementia, and Gastroesophageal Reflux Disease (GERD). CAD case definition was based on a composite of myocardial infarction (Khera et al. 2018). Myocardial infarction was based on self-report or hospital admission diagnosis. This included individuals with ICD-9 codes of 410.X, 411.0, 412.X, or 429.79, or ICD-10 codes of I21.X, I22.X, I23.X, I24.1, or I25.2 in hospitalization records. The case/control definitions of T2D, HF, Dementia and GERD are based on the validated algorithms available at the Phenotype KnowledgeBase (PheKB) (Denny and University 2012; Clinic 2013; Cooperative 2012; CHOP Phenotyping group 2014). We used 50% of the cases and an equal number of controls for training, and the rest of the cases and controls as test set for performance evaluation. The number of cases and controls for each phenotype are listed in Table 3. Those individuals who are neither case nor control for a given phenotype are treated as having unknown status.
load("data/phenotype_counts.rda")
rownames(phenotype_counts) = NULL
kable(phenotype_counts, caption = "**Number of individuals (cases vs. controls) in training and test datasets.**", booktabs = TRUE) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),fixed_thead = T)
| Phenotype | Feature Phecodes | Training Control | Training Case | Test Control | Test Case |
|---|---|---|---|---|---|
| CKD | 104 | 7334 | 17897 | 7334 | 17898 |
| CAD | 93 | 5479 | 5479 | 21916 | 5479 |
| T2D | 153 | 3767 | 3767 | 9960 | 3768 |
| HF | 90 | 1724 | 1724 | 10806 | 1724 |
| Dementia | 143 | 879 | 879 | 7570 | 879 |
| GERD | 162 | 3233 | 3233 | 5918 | 3233 |
kable(ckd.cases, booktabs = TRUE) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),fixed_thead = T)
| CKD.Cases | G1 | G2 | G3a | G3b | G4 |
|---|---|---|---|---|---|
| 39602 | 2370 | 25411 | 7055 | 3462 | 1304 |
We focused on including as phenotypic features those phecodes that we deemed possibly relevant to the disease under consideration. In order to be able to validate the resulting quantitative disease risk scores, we have excluded the case defining phecodes (Table 4). Note that this was done only for evaluation purposes, and in the final scores we did include such phecodes. For CKD, we included 104 CKD phecodes (manually selected by experts), 93 CAD phecodes (among the ‘circulatory system’ category, those with p-values <10^{-5} in logistic regression of each phecode against CAD polygenic risk scores), 132 T2D phecodes (manually selected by experts among the significant phecodes with p-values <10^{-5} in logistic regression of each phecode against T2D polygenic risk scores), 90 HF phecodes (CAD feature phecodes with HF case defining phecodes removed), 143 Dementia phecodes (from the ‘mental disorder’ and ‘neurological’ categories) and 162 GERD phecodes (from the ‘digestive’ category).
library(kableExtra)
load("data/Case_defining_phecodes.rda")
kable(Case_defining_phecodes, caption = "**Phecodes that are used to define cases for phenotypes (at least one occurrence) in eMERGE dataset.**", booktabs = TRUE) %>% collapse_rows(columns = 1, latex_hline = "major", valign = "middle") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),fixed_thead = T)
| Phenotype | Phecode | Description | Category |
|---|---|---|---|
| CKD | 585.3 | Chronic renal failure [CKD] | genitourinary |
| 585.31 | Renal dialysis | genitourinary | |
| 585.32 | End stage renal disease | genitourinary | |
| 585.33 | Chronic Kidney Disease, Stage III | genitourinary | |
| 585.34 | Chronic Kidney Disease, Stage IV | genitourinary | |
| 585.4 | Chronic kidney disease, Stage I or II | genitourinary | |
| CAD | 411.2 | Myocardial infarction | circulatory system |
| 414 | Other forms of chronic heart disease | circulatory system | |
| T2D | 250.2 | Type 2 diabetes | endocrine/metabolic |
| 250.21 | Type 2 diabetes with ketoacidosis | endocrine/metabolic | |
| 250.22 | Type 2 diabetes with renal manifestations | endocrine/metabolic | |
| 250.23 | Type 2 diabetes with ophthalmic manifestations | endocrine/metabolic | |
| 250.24 | Type 2 diabetes with neurological manifestations | endocrine/metabolic | |
| 250.25 | Diabetes type 2 with peripheral circulatory disorders | endocrine/metabolic | |
| HF | 428 | Congestive heart failure; nonhypertensive | circulatory system |
| 428.1 | Congestive heart failure (CHF) NOS | circulatory system | |
| 428.2 | Heart failure NOS | circulatory system | |
| 428.3 | Heart failure with reduced EF [Systolic or combined heart failure] | circulatory system | |
| 428.4 | Heart failure with preserved EF [Diastolic heart failure] | circulatory system | |
| Dementia | 290 | Delirium dementia and amnestic and other cognitive disorders | mental disorders |
| 290.1 | Dementias | mental disorders | |
| 290.11 | Alzheimer’s disease | mental disorders | |
| 290.12 | Dementia with cerebral degenerations | mental disorders | |
| 290.13 | Senile dementia | mental disorders | |
| 290.16 | Vascular dementia | mental disorders | |
| 290.2 | Delirium due to conditions classified elsewhere | mental disorders | |
| 317.1 | Alcoholism | mental disorders | |
| 292.2 | Mild cognitive impairment | mental disorders | |
| 317 | Alcohol-related disorders | mental disorders | |
| 290.3 | Other persistent mental disorders due to conditions classified elsewhere | mental disorders | |
| GERD | 530.1 | Esophagitis, GERD and related diseases | digestive |
| 530.11 | GERD | digestive | |
| 530.14 | Reflux esophagitis | digestive |
This section provides CKD as an example of deriving quantitative disease risk scores. The details of this data set and selection of CKD ‘relevant’ phecodes are described in Section Data Inputs and Pre-processing.
# load the CKD training set
# two phecode matrices
load("data/ckd_phedat_training.rda")
dim(ckd_phedat_training)
#[1] 25231 1874
dim(ckd_phedat_training_unscaled)
#[1] 25231 1874
load("data/ckd_phedat_test.rda")
dim(ckd_phedat_test)
#[1] 25232 1874
dim(ckd_phedat_test_unscaled)
#[1] 25232 1874
# load the list of relevant phecodes including the case defining phecodes
load("data/ckd_phecodes_list.rda")
length(ckd_phecodes)
#[1] 104
length(ckd_label_phecodes)
#[1] 6
PheRS. The weights (prevalences) for the PheRS for each phenotype were estimated based on the set of non-cases (controls plus individuals with unknown status). Then the PheRS weights are computed as the log of inverse prevalence of the phecodes. The following codes are calculating phecode prevalence for CKD based on the controls plus the individuals with unknown status:
# load the phecode matrix
load("/Volumes/Seagate Bac/Multiple Phenotypes/common/phecode_dat_eMERGE.rda")
phecode_mat = as.data.frame(Phecode_matrix[!duplicated(Phecode_matrix[,1]),])
# load the algorithm defined CKD labels
load("/Volumes/Seagate Bac/Multiple Phenotypes/common/eMERGE_ckd_labels.rda")
# merge the phecode matrix with label information matrix
phecode_mat = merge(eMERGE_ckd_labels, phecode_mat, by.x="eMERGE_ID", by.y="ID", sort=F)
phecode_mat_ckd_health = phecode_mat[phecode_mat$Group=="Control",]
prevalence.ckd = colMeans(phecode_mat_ckd_health[,-c(1:7)])
# reformat as data frame
prevalence.ckd = matrix(prevalence.ckd,nr=1)
colnames(prevalence.ckd) = colnames(phecode_mat_ckd_health[,-c(1:7)])
load("data/phecode.prevalence.rda")
subexample = round(phecode.prevalence[,c("prevalence.ckd","prevalence.cad","prevalence.T2D","prevalence.hf","prevalence.Dementia","prevalence.gerd")],3)
colnames(subexample) <- c("CKD", "CAD", "T2D", "HF", "Dementia", "GERD")
temp <- data.frame(code = phecode.prevalence$code, subexample)
rownames(temp) = NULL
kable(tail(temp), caption = "**Example: Prevalence of Phecodes Listed by Phenotypes.**", booktabs = TRUE, ) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),fixed_thead = T)
| code | CKD | CAD | T2D | HF | Dementia | GERD | |
|---|---|---|---|---|---|---|---|
| 1861 | code_990 | 0.021 | 0.025 | 0.027 | 0.027 | 0.027 | 0.027 |
| 1862 | code_994 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
| 1863 | code_994.1 | 0.014 | 0.010 | 0.012 | 0.013 | 0.013 | 0.012 |
| 1864 | code_994.2 | 0.058 | 0.041 | 0.049 | 0.049 | 0.054 | 0.052 |
| 1865 | code_994.21 | 0.023 | 0.014 | 0.018 | 0.018 | 0.019 | 0.019 |
| 1866 | code_996 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
Here we derive CKD PheRS weights from the training set based on the prevalence computed above, then use the predictQDRS() function to predict the PheRS for test set.
load("data/prevalence.ckd.rda")
PheRS.res = PheRS(X = ckd_phedat_training_unscaled[,ckd_phecodes],
feature.prevalence = as.numeric(prevalence.ckd[,ckd_phecodes]))
PheRS.test = predictQDRS(Y = ckd_phedat_test_unscaled[,ckd_phecodes],
weights = PheRS.res$weights)
ckd.QDRS.test = data.frame(Group = ckd_phedat_test$Group,
G_Staging = ckd_phedat_test$G_Staging,
PheRS = as.vector(PheRS.test))
Eigen. The calculation of Eigen has two steps: (1) estimate the rank-one R matrix, and (2) solve the system of linear equations described in Section Methods. For the matrix factorization approaches, if the data set is not large, setting the arguments that X to the entire original data set, training to the index/indicator vector for training subset, and scale to logical value TRUE, means the function will first scale the data set, then use the specified subset to train the weights, and compute the scores for all subjects.
phecode_m = ckd_phedat_training[,ckd_phecodes]
# this is the scaled training set (pre-processed the union of training and test set)
# so set training to all row numbers
# then the function returns weights and scores
Eigen.res = eigen.score(X = phecode_m,
training = 1:nrow(phecode_m),
scale = FALSE)
Eigen.test = predictQDRS(Y = ckd_phedat_test[,ckd_phecodes],
weights = Eigen.res$weights)
ckd.QDRS.test = data.frame(ckd.QDRS.test,
Eigen = as.vector(Eigen.test))
Individual PC. The sign of principal component depends on the computation platform. Hence, the individual PC has a step of using the existing algorithm defined case-control status to select the sign, i.e., selecting the sign so that the cases have higher mean scores for training set.
PC.res = PC(X = phecode_m,
group = ckd_phedat_training$Group,
training = 1:nrow(phecode_m),
scale = FALSE,
pc.num = 1:2)
PC12.test = as.data.frame(predictQDRS(Y = ckd_phedat_test[,ckd_phecodes],
weights = PC.res$weights))
ckd.QDRS.test = data.frame(ckd.QDRS.test,
PC12.test)
LPC. The LPC score is a weighted linear combination of multiple above mentioned PCs, where the weights are the corresponding eigenvalues. The number of significant eigenvalues are determined by Tracy-Widom test. Tracy-Widom test suggested 9, 12, 14, 11, 8, and 20 significant eigenvalues (PCs) of the covariance matrix of feature phecodes for CKD, CAD, T2D, HF, Dementia and GERD, respectively.
LPC.res = LPC(X = phecode_m,
group = ckd_phedat_training$Group,
training = 1:nrow(phecode_m),
scale = FALSE)
LPC.test = as.data.frame(predictQDRS(Y = ckd_phedat_test[,ckd_phecodes],
weights = LPC.res$weights))
ckd.QDRS.test = data.frame(ckd.QDRS.test,
LPC.test)
colnames(ckd.QDRS.test)[3:7] = c("PheRS", "Eigen",
"PC1","PC2","LPC")
LVS. The training step is to fit a generalized linear latent variable models (GLLVM) with one latent variable using variational approximation method for multivariate bernoulli data. The next step is to predict the latent variable at new observations. For the LVS, because of the computational burden, we used a subset of the CKD training set (Control: 2,444, G1:395, G2: 3,701, G3a: 1,102, G3b: 555 and G4: 213) to train the latent variable model. Note that if the starting point is random (default value), then the derived scores may be slightly different from different runs.
load("data/ckd_phedat_training_lvs.rda")
dim(ckd_phedat_training_unscaled)
#[1] 8410 1874
# this computation may take a long time
#LVS.test = LVS.score(X = ckd_phedat_training_unscaled[,ckd_phecodes],
Y = #ckd_phedat_test_unscaled[,ckd_phecodes],
p.seed = 2032)
# Recommend: compute LVS on server
# fit the model first
# then predict the values in subsets in parallel
LVS.model = LVS.fit(X = ckd_phedat_training_unscaled[,ckd_phecodes],
family = "binomial",
p.seed = 2032)
# predict subset, e.g., 50 in a subset
LVS.test.sub = predictLVS(ckd_phedat_test_unscaled[1:50,ckd_phecodes],
LVS.model)We compared the different scores in terms of area under the receiver operation characteristic curve (AUROC) and the area under the precision recall curve (AUPRC) in the test datasets. We observed that LPC exhibits the largest AUROC compared with PheRS: 0.779 vs. 0.76 for CKD, 0.927 vs. 0.917 for CAD, 0.764 vs. 0.739 for T2D, 0.922 vs. 0.898 for HF, 0.779 vs. 0.684 for Dementia, and the second largest AUROC compared with PheRS for GERD (0.809 vs. 0.805) (Figure 2). LPC also has the largest AUPRC for all studied phenotypes (Figure 3). PC2 has weak performance in general with AUROCs 0.569 for CKD, 0.508 for CAD, 0.541 for T2D, 0.603 for HF and 0.634 for GERD.
knitr::include_graphics("fig/roc_curves_6_phenotypes.jpg")
Figure 2: ROC curves cases vs. controls for six phenotypes.
knitr::include_graphics("fig/pr_curves_6_phenotypes.jpg")
Figure 3: PR curves cases vs. controls for six phenotypes.
Similar results were obtained when comparing controls with cases at different CKD G stages (Figures (fig:ckdROC1) and 5). Note that the AUROC values are much higher when contrasting controls with more advanced stages of CKD (G3b: 0.907 for LPC, G4: 0.933 for LPC).
knitr::include_graphics("fig/ckd_roc_curves_analysis.jpg")
Figure 4: ROC curves for CKD Cases vs. Controls, and CKD cases at different G stages vs. controls.
knitr::include_graphics("fig/ckd_pr_curves_analysis.jpg")
Figure 5: PR curves for CKD Cases vs. Controls, and CKD cases at different G stages vs. controls.
R functions to compare performance. Figure 2-5 are generated by some hard coding for this application. The multiple.roc.curve() and multiple.pr.curve() can generate similar plots for grouping level contrast, e.g. Case vs. Control. The resulting score matrix score.mat have subjects as rows and different scores as columns, and score.names is the vector of column names for comparison.
multiple.roc.curve(
disease,
output.date = NULL,
score.mat,
score.names,
group,
group.levels,
legend.pos = "bottomright",
pairs.sub = NULL
)
multiple.pr.curve(
disease,
output.date = NULL,
score.mat,
score.names,
group,
group.levels,
legend.pos = "bottomright",
pairs.sub = NULL
)
We now consider the impact of adding many more phecodes as possible phenotypic features to the computation of the quantitative disease risk scores. We randomly selected phecodes among the 1,817 phecodes available, and assess the effect of including them on the accuracy of the resulting scores. We added phecodes sequentially and evaluated the performance of different methods. The AUROC values for the different methods as a function of the number of random phecodes added is shown in Figure 6. As shown, with the exception of PC2, the quantitative disease risk scores tend to be quite robust to the addition of new phecodes, with the LPC method showing even some improvement for CKD, suggesting that, as more phecodes are added, additional PCs can capture more information. See Figure 7 for similar pattern in AUPRC values.
knitr::include_graphics("fig/roc_added_phecodes.jpg")
Figure 6: AUROC for cases vs. controls with increasing number of phecodes.
knitr::include_graphics("fig/pr_added_phecodes.jpg")
Figure 7: AUPRC for cases vs. controls with increasing number of phecodes.
Although this robustness may seem surprising, it is important to note that the training datasets are different for the different diseases (each training dataset is enriched in cases for the disease under consideration), which leads to different correlation patterns among the phecodes. Indeed, we show that the weights for the ‘relevant’ phecodes are significantly higher compared with those of the additional phecodes for the proposed quantitative disease risk scores (Table 6). Although PheRS does surprisingly well with the addition of new phecodes, the weights derived by the PheRS are not significantly different between the `relevant’ phecodes used originally and the newly added ones. The PheRS weights are negatively correlated with the LPC weights (Figure 8).
load("data/wilcox_added_phecodes.rda")
library(kableExtra)
kable(wilcox.added, caption = "**Weights for 'relevant' phecodes vs. additional phecodes.** The 'relevant' phecodes include the case defining phecodes. Wilcoxon rank-sum test one-sided p-values comparing weights for 'relevant' phecodes vs. additional phecodes are reported (with alternative hypothesis: selected phecodes have greater weights).", booktabs = TRUE) %>% collapse_rows(columns = 1, latex_hline = "major", valign = "middle") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),fixed_thead = T)
| Phenotype | added | PheRS | Eigen | PC1 | PC2 | LPC |
|---|---|---|---|---|---|---|
| CKD | 200 | 1.00E+00 | 4.78E-10 | 8.34E-18 | 5.05E-04 | 1.33E-03 |
| 1200 | 9.98E-01 | 2.31E-04 | 1.40E-06 | 5.74E-09 | 3.11E-22 | |
| CAD | 200 | 9.99E-01 | 2.55E-18 | 3.30E-26 | 4.41E-02 | 9.86E-20 |
| 1200 | 9.99E-01 | 2.26E-10 | 3.10E-16 | 1.36E-24 | 3.75E-34 | |
| T2D | 200 | 1.00E+00 | 2.18E-21 | 5.14E-27 | 1.98E-05 | 5.80E-29 |
| 1200 | 1.00E+00 | NA | 3.11E-17 | 1.78E-22 | 7.97E-36 | |
| HF | 200 | 1.00E+00 | 1.52E-16 | 1.17E-23 | 1.17E-02 | 7.19E-11 |
| 1200 | 1.00E+00 | 4.24E-12 | 3.95E-13 | 2.73E-21 | 1.78E-21 | |
| Dementia | 200 | 6.66E-01 | 4.36E-02 | 1.61E-02 | 1.09E-05 | 5.25E-06 |
| 1200 | 9.69E-01 | 2.97E-02 | 1.94E-01 | 1.00E+00 | 2.61E-02 | |
| GERD | 200 | 4.93E-01 | 3.85E-03 | 6.82E-04 | 4.66E-07 | 1.57E-10 |
| 1200 | 8.53E-01 | 5.90E-02 | 2.33E-02 | 1.00E+00 | 1.69E-06 |
knitr::include_graphics("fig/prevalence_plot_PheRS_LPC_1200_all_phecodes.jpg")
Figure 8: Weights for phecodes used to build PheRS and LPC for 6 phenotypes.
The derived quantitative disease risk scores correlate very well with the CKD staging (Figure 9), providing support to the use of these quantitative disease risk scores as measures of disease severity (Figure 9). Note that the distributions of risk scores for CKD Control and G1 stages are similar since G1 is defined as individuals who have normal renal function but have other abnormality that makes them classified as CKD.
knitr::include_graphics("fig/ckd_G_boxplot_analysis_110_phecodes_lvs_random.jpg")
Figure 9: Quantitative disease risk scores vs. CKD G-staging.
Furthermore, we investigated the prevalence of cases in the test set among individuals at different percentiles of LPC risk score. The proportion of cases increases among the individuals with higher LPC quantitative disease risk scores, as expected (Figures 10 and 11. We noted that the prevalence of Dementia cases among individuals with high quantitative disease risk score for Dementia is much lower (30%~53.6%) than for other diseases (CKD: 97.2%~100%, CAD: 81.4%~90.9%, T2D: 69.3%~98.5%, HF: 74.4%~96.8%, GERD: 88%~96.7%), which suggests that Dementia is a more difficult disease to diagnose than other diseases. The other risk scores (PheRS, Eigen, PC1 and LVS) show similar overall patterns.
We have repeated the analyses above by considering a less stringent control definition, namely including in addition to controls defined by the algorithm also those with unknown status. For CKD, we have noticed an interesting pattern. For the most extreme values of the LPC quantitative disease risk scores we noticed a sudden decrease in prevalence, suggesting that there are individuals with high quantitative disease risk scores that have unknown case status (Figure 10). This emphasizes the difficulties in obtaining accurate phenotypic labels, and the potential of the quantitative disease risk scores to identify undiagnosed cases.
knitr::include_graphics("fig/LPC_prev_plots_CKD_rest_all.jpg")
Figure 10: Distribution of CKD LPC risk scores in the test set vs. test set + individuals with unknown status.
knitr::include_graphics("fig/LPC_prev_plots_all_phecodes.jpg")
Figure 11: Distribution of final LPC scores for six phenotypes in the test set. LPC risk scores are derived based on features that include case defining phecodes. For each phenotype: Left, distribution of LPC risk scores in the test set. Middle, LPC risk score percentiles among cases vs. controls. Right, case prevalence in 60 bins according to the percentiles of LPC risk scores.
The R function dist.plot() plots a density plot, a boxplot and a prevalence plot for quantitative disease risk scores when the grouping is ‘Case vs. Control’ or ‘Case vs. Control vs. Unknowns’.
dist.plot(disease = "CKD",
score.mat = ckd.QDRS.test,
score.name = "LPC",
group = ckd.QDRS.test$Group,
# FALSE for removing the unknowns
unknown.show = FALSE
# TRUE for showing the unknowns
#unknown.show = TRUE
)
The eMERGE Network created an eMERGE specific sequencing platform and sequenced a cohort of 25,000 participants (including 14,813 Non Hispanic White (NHW), 3,110 African American, 1,437 Asian and 1,301 Hispanic) with an eMERGE-seq panel that includes 109 actionable genes (Consortium and others 2019). The 109 genes include 56 genes from the American College of Medical Genetics and Genomics (ACMG) published recommendation for actionable findings and additional genes deemed as potentially actionable that were selected across all eMERGE sites (Green et al. 2013). Low quality variant calls were filtered out based on GATK recommendations, which resulted in 57,398 variants. Among the 25,000 participants, 21,363 individuals have both sequencing data and quantitative disease risk scores. More details on quality control steps are given in the Materials and Methods section.
We have performed comprehensive association tests with both rare and common variants within each individual gene (S, MJ, and al. 2012,@FST). For each gene, we have combined several tests, as follows:
Burden and dispersion tests for common and low frequency variants (MAF > 0.01) with Beta (MAF, 1, 25) weights, where Beta (\(\cdot\)) is the probability density function of the beta distribution with shape parameters 1 and 25.
Burden and dispersion tests for rare variants (MAF < 0.01 & minor allele count (MAC) \(\ge\) 5) with Beta (MAF, 1, 25) weights.
Burden and dispersion tests for rare variants, weighted by functional annotations (CADD, PolyPhen).
Burden test for aggregation of ultra-rare variants with MAC \(<\) 5 (e.g. singletons, doubletons).
Single variant score tests for common, low frequency and rare variants in the gene.
We then applied the aggregated Cauchy association test (Liu and Xie 2020) to combine the p values from 1-5 to compute the final p value for a gene. We adjusted for age, gender, and 10 principal components of genetic variation. The distribution of LPC scores is right skewed (Figure 11), therefore we assumed a generalized linear model (GLM) based on the Inverse-Gaussian distribution.
Here an example of identifying a significant gene based on comprehensive association tests as mentioned above.
Import packages. Install required packages if missing.
# install required packages
pkgs = c("dplyr", "Matrix", "SPAtest", "CompQuadForm", "SKAT", "knitr")
pkgs.na = pkgs[!pkgs %in% installed.packages()[, "Package"]]
if (length(pkgs.na) > 0) {
install.packages(pkgs.na, dependencies = TRUE)
}
Import packages, scripts, and datasets including phenotypes (Y), covariates (X), genotypes (G), and functional annotations (CADD, PolyPhen-2).
Null model. Fit null model for regression. Dateset: phenotypes (Y), covariates (X), and individual identifiers (id)
Gene test. Genotype matrix (G) includes genotype calls for all the sites within a given gene (region). Generate variant-weights based on MAC, MAF, CADD, and PolyPhen-2.
# minor allele frequency and minor allele count
MAF = colMeans(G)/2
MAC = colSums(G)
MAF.threshold = 0.01
# weight using MAF
window.matrix = Matrix(matrix(data = 1, nrow = ncol(G), ncol = 1))
if (sum(MAC) < 5 | ncol(G) < 2) {
window.matrix = window.matrix * 0
}
weight.matrix = cbind(MAC<5,
(MAF<MAF.threshold&MAC>=5)*dbeta(MAF,1,25),
(MAF>=MAF.threshold)*dbeta(MAF,1,25))
colnames(weight.matrix) = c("MAC<5",
paste0("MAF<", MAF.threshold, "&MAC>=5&Beta"),
paste0("MAF>=", MAF.threshold ,"&Beta"))
# weight using CADD
colnames(CADD) = paste0("MAF<",
MAF.threshold, "&MAC>=5&CADD")
weight.matrix = cbind(weight.matrix,
(MAF<MAF.threshold&MAC>=5)*CADD)
# weight using PolyPhen-2
colnames(polyphen) = paste0("MAF<", MAF.threshold, "&MAC>=5&Polyphen")
weight.matrix = cbind(weight.matrix,
(MAF<MAF.threshold&MAC>=5)*polyphen)
# all weights
weight.matrix = Matrix(weight.matrix)
Run burden tests and SKAT tests integrating different weights. Different tests are aggregated using Cauchy combination.
fit = KS_test(G, MAF, MAC, result.prelim, window.matrix, weight.matrix)
Results. Save summary statistics
# save results
sumstats = data.frame(Gene = gene,
Chr = paste0(chr),
Start = paste0(start),
End = paste0(end),
p.Cauchy.all = fit$p.KS,
p.Cauchy.common = fit$p.KS.common,
p.Cauchy.rare = fit$p.KS.rare,
fit$p.individual,
stringsAsFactors = F,
check.names = F)
kable(select(sumstats, Gene:p.Cauchy.all),
col.names = c("Gene", "Chr", "Start", "End", "P-value"),
caption = paste0("**Summary statistics for ", gene,"**"), digits = 10,
format.args = list(scientific = T))
| Gene | Chr | Start | End | P-value |
|---|---|---|---|---|
| LDLR | 19 | 11200228 | 11241989 | 8e-10 |
write.table(x = sumstats,
file = paste0("demo.sumstats.tsv"), sep = "\t",
quote = F,
row.names = F,
col.names = T)
al., Shang N et. 2019. “Medical Records-Based Kidney Disease Phenotype for “Big Data" Observational and Genetic Studies.” Biorxiv.
Aschard, Hugues, Bjarni J Vilhjálmsson, Nicolas Greliche, Pierre-Emmanuel Morange, David-Alexandre Trégouët, and Peter Kraft. 2014. “Maximizing the Power of Principal-Component Analysis of Correlated Phenotypes in Genome-Wide Association Studies.” The American Journal of Human Genetics 94 (5): 662–76.
Bastarache, Lisa, Jacob J Hughey, Scott Hebbring, Joy Marlo, Wanke Zhao, Wanting T Ho, Sara L Van Driest, et al. 2018. “Phenotype Risk Scores Identify Patients with Unrecognized Mendelian Disease Patterns.” Science 359 (6381): 1233–9.
CHOP Phenotyping group, CHOP. 2014. “Gastroesophageal Reflux Disease (Gerd) Phenotype Algorithm.” PheKB.
Clinic, Suzette J. Bielinski. Mayo. 2013. “Heart Failure (Hf) with Differentiation Between Preserved and Reduced Ejection Fraction.” PheKB.
Consortium, and others. 2019. “Harmonizing Clinical Sequencing and Interpretation for the eMERGE Iii Network.” American Journal of Human Genetics 105 (3): 588–605.
Cooperative, Chris Carlson. Group Health. 2012. “Dementia.” PheKB.
Denny, Joshua C, Lisa Bastarache, Marylyn D Ritchie, Robert J Carroll, Raquel Zink, Jonathan D Mosley, Julie R Field, et al. 2013. “Systematic Comparison of Phenome-Wide Association Study of Electronic Medical Record Data and Genome-Wide Association Study Data.” Nature Biotechnology 31 (12): 1102.
Denny, Josh, and Melissa Basford. Vanderbilt University. 2012. “Type 2 Diabetes - Demonstration Project.” PheKB.
Green, Robert C, Jonathan S Berg, Wayne W Grody, Sarah S Kalia, Bruce R Korf, Christa L Martin, Amy L McGuire, et al. 2013. “ACMG Recommendations for Reporting of Incidental Findings in Clinical Exome and Genome Sequencing.” Genetics in Medicine 15 (7): 565–74.
Ionita-Laza, Iuliana, Kenneth McCallum, Bin Xu, and Joseph D Buxbaum. 2016. “A Spectral Approach Integrating Functional Genomic Annotations for Coding and Noncoding Variants.” Nature Genetics 48 (2): 214.
Johnstone, Iain M, and others. 2001. “On the Distribution of the Largest Eigenvalue in Principal Components Analysis.” The Annals of Statistics 29 (2): 295–327.
Khera, Amit V, Mark Chaffin, Krishna G Aragam, Mary E Haas, Carolina Roselli, Seung Hoan Choi, Pradeep Natarajan, et al. 2018. “Genome-Wide Polygenic Scores for Common Diseases Identify Individuals with Risk Equivalent to Monogenic Mutations.” Nature Genetics 50 (9): 1219.
Liu, Yaowu, and Jun Xie. 2020. “Cauchy Combination Test: A Powerful Test with Analytic P-Value Calculation Under Arbitrary Dependency Structures.” Journal of the American Statistical Association 115 (529): 393–402.
Liu, Zhonghua, and Xihong Lin. 2019. “A Geometric Perspective on the Power of Principal Component Association Tests in Multiple Phenotype Studies.” Journal of the American Statistical Association, 1–32.
Niku, Jenni, Wesley Brooks, Riki Herliansyah, Francis KC Hui, Sara Taskinen, and David I Warton. 2019. “Efficient Estimation of Generalized Linear Latent Variable Models.” PloS One 14 (5).
S, Lee, Emond MJ, and Bamshad MJ et al. 2012. “Optimal Unified Approach for Rare-Variant Association Testing with Application to Small-Sample Case-Control Whole-Exome Sequencing Studies.” Am J Hum Genet. 91 (2): 224‐–237.
Sammel, Mary Dupuis, Louise M Ryan, and Julie M Legler. 1997. “Latent Variable Models for Mixed Discrete and Continuous Outcomes.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 59 (3): 667–78.
Wainwright, Martin J, Michael I Jordan, and others. 2008. “Graphical Models, Exponential Families, and Variational Inference.” Foundations and Trends in Machine Learning 1 (1–2): 1–305.
Z, He, Xu B, Lee S, and Ionita-Laza I. 2017. “Unified Sequence-Based Association Tests Allowing for Multiple Functional Annotations and Meta-Analysis of Noncoding Variation in Metabochip Data.” The American Journal of Human Genetics 101: 340–52.